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ABSTRACT 

We have developed a method for fast and accurate stellar population parameters 
determination in order to apply it to high resolution galaxy spectra. The method is 
based on an optimization technique that combines active learning with an instance- 
based machine learning algorithm. 

We tested the method with the retrieval of the star-formation history and dust 
content in "synthetic" galaxies with a wide range of S/N ratios. The "synthetic" 
galaxies where constructed using two different grids of high resolution theoretical 
population synthesis models. 

The results of our controlled experiment shows that our method can estimate 
with good speed and accuracy the parameters of the stellar populations that make 
up the galaxy even for very low S/N input. For a spectrum with S/N= 5 the typical 
average deviation between the input and fitted spectrum is less than 10~ 5 . Additional 
improvements are achieved using prior knowledge. 

Key words: galaxies: fundamental parameters - galaxies: stellar content - method: 
data analysis - method: numerical - method: statistical. 



1 INTRODUCTION 

The availability of large astronomical spectroscopic surveys 
with moderate spectral resolution such as the 2dF (Colless 
et al. 2001) or the Sloan Digital Sky Survey (SDSS, York et 
al. 2000; Stoughton et al. 2002), has prompted the computa- 
tion of new grids of high resolution spectral synthesis models 
creating the need of highly efficient methods for the determi- 
nation of intrinsic physical parameters of a large number of 
galaxies. There are three intrinsic galactic parameters that 
are particularly important for studies of cosmological evolu- 
tion: The star formation and chemical composition histories 
and the mass distribution of their stellar populations. The 
importance of the accurate knowledge of these parameters 
for cosmological studies and for the understanding of galaxy 
formation and evolution cannot be overestimated. Template 
fitting has been widely used to carry out estimates of the 
distribution of age and metallicity from spectral data. Al- 
though this technique achieves good results, it is expensive 
in terms of computing time (therefore is best applied to rel- 
atively small samples e.g. Mayya et al. 2004) and the results 
are in general compromised by the low signal-to-noise data 
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(Kauffmann et al. 2003; Tremonti et al. 2004; Cid fernandes 
et al. 2005). 

Until recently, synthesis models provided either low res- 
olution and a range of metallicities using theoretical atmo- 
spheres, or medium resolution at basically solar abundance 
with the use of empirical stellar spectra. A major problem 
with theoretical atmospheres used to be that the sampling 
was coarser than the line broadening observed even in the 
most massive galaxies. For massive ellipticals with velocity 
dispersion of up to 400 km s _1 sampling about or better than 
7 A px 1 in the optical region is needed for representing 
their spectra with minimum loss of information. For dwarf 
galaxies or globular clusters with velocity dispersion all the 
way down to 5-10 km s _1 the optimum sampling is around 
0.1 Apx -1 . 

Clearly comparing data obtained with sampling of 70 
km s _1 like the SDSS with models with sampling of 1200 
km s _1 at 5000A is not satisfactory in the sense that much 
information associated with atomic lines and even relatively 
narrow molecular bands will be washed out by the large un- 
der sampling. On the other hand, by smoothing or filtering 
the high frequencies in the data, a more compact and eas- 
ier/faster to process data set is created (Heavens, Jimenez 
and Lahav 2000, Heavens et al. 2004). 

To overcome these problems we have explored new 
methods that, while exploiting the high resolution achieved 
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by recent synthesis models, maximize both speed and accu- 
racy in the determination of stellar population parameters. 

Minimum distance methods and the closely related chi- 
square minimization present two significant drawbacks as 
classification tools. Firstly, they depend crucially on the 
choice of standard objects to be used; in many problems, 
it is impossible to select a representative member for each 
class or each combination of parameters. Secondly, it is diffi- 
cult to include information regarding intra-class variability, 
as the only information provided is a typical representative 
member of the class. Machine learning approaches, on the 
other hand, use training data that include many represen- 
tative examples for each class, which makes the selection 
of standards unnecessary and provides information to de- 
termine the features that discriminate members of different 
classes. 

Machine learning algorithms have been shown to sig- 
nificantly outperform minimum distance methods and chi- 
square minimization in a number of astronomical applica- 
tions, including stellar classification and determination of 
stellar atmospheric parameters. For instance, Bailer- Jones 
(f996) showed that a committee of simple feedforward neu- 
ral networks yields an error reduction of about 50 percent 
compared with minimum distance when applied to the task 
of stellar classification; similar results were reported in Gu- 
lati and Gupta (1997). Bailer- Jones (1996) also mentions 
the fact that, for regions where training data are sparse, the 
performance advantage of neural network decreases. Thus, 
methods that automatically add training data to undersam- 
pled regions, as the one we present in this paper, are highly 
desirable. 

In this, the first paper of a series, we test a technique 
that approximates non-linear multidimensional functions us- 
ing a small initial training set, and by using active learning 
it increases this training set as needed according to the ele- 
ments of the test set. This method has shown to outperform 
traditional instance-base learning algorithms on the problem 
of interferogram analysis (Fuentes and Solorio, 2004). 

Here we present the results of a series of controlled ex- 
periments showing that this method can quickly and accu- 
rately retrieve the physical parameters of "simulated" galax- 
ies, even at a very low S/N level. Our method takes also ad- 
vantage of prior domain knowledge which is used to further 
increase the accuracy of the results obtained. In a forth- 
coming paper (Solorio et al., in preparation) we apply this 
methodology to large data sets of galaxy spectra to charac- 
terize their stellar population fabric. 



2 TESTING THE METHOD WITH 
SYNTHETIC GALAXIES 

Before blindly applying a new method to real data it is rea- 
sonable to critically test the procedure in a controlled en- 
vironment. A crucial aspect is that the validity of the test 
increases as the test conditions approach the real case. For 
this reason we have created synthetic galaxies as realistic as 
possible and necessary in this first step in our research. We 
thus have applied our methods to a reference set of "syn- 
thetic" high resolution spectra of galaxies. To minimize sys- 
tematics associated with the use of a particular model we 
have used two different sets of new high resolution spectral 



synthesis models (for this test only solar metallicity ones) to 
generate the reference synthetic galaxy spectra set (Bertone 
et al. 2004: Padova models; Gonzalez-Delgado et al. 2005: 
Granada models). The high spectral resolution of the mod- 
els, allows to use them in the study of narrow absorption 
lines and for the spectral evolution of the intense line pro- 
files over a wide range of ages. It should be emphasized that 
our goal is to test the effectiveness of the method in two 
different sets of models, in order to assess its robustness. 
We are not trying to determine the respective merits of the 
models, thus our experiments do not give evidence of any of 
this. We will address this point in a forthcoming paper. 

The Granada models are Single Stellar Population 
(SSP) synthesis calculated for ages ranging from 1 Myr to 
17 Gyr using the Padova and Geneva stellar evolutionary 
tracks, and their own stellar atmospheres library with spec- 
tral sampling of 0.3 A, and a wavelength coverage of 3000- 
7000 A (Martins et al. 2005) . Of the various models available 
regarding metallicities, we only use for this first work the so- 
lar metallicity ones. The synthetic stellar library has been 
computed with the latest stellar atmospheres, non-LTE for 
the hot and LTE line-blanketed models for the cold stars. A 
full description of the models is given by Gonzalez-Delgado 
et al. 2005. 

A second set of integrated high resolution spectra that 
we will call the Padova set, was kindly computed for us by 
A. Bressan (private communication) according to the pre- 
scriptions outlined in Bressan, Chiosi and Fagotto (1994). 
Spectral fluxes along the Bertelli et al. (1994) isochrones 
were integrated adopting a Salpeter initial mass function 
(IMF) between 0.15 and 120 M . Kurucz high resolution 
(R=50000) synthetic stellar spectra from 3500A to 4500A, 
were kindly provided by L. Rodriguez, M. Chavez and E. 
Bertone before publication (Rodriguez- Merino et al. 2005). 
The red end of the spectra was completed using their 20 A 
resolution models from 4500 to 7000 A (Bressan et al. 1994). 
The spectral resolution of the SSPs where finally degraded 
to R= 10000. 



2.1 Synthetic galaxies 

To construct the spectrum of the synthetic galaxies we com- 
bined three different populations corresponding to young, in- 
termediate age and old single-age stellar populations (SSP) 
in varying proportions. To each population we added inde- 
pendent dust attenuation (extinction). The effects of adding 
noise are discussed in the next section. 

Let /(A) be the energy flux emitted by a star or group of 
stars at wavelength A. The flux detected by a measuring de- 
vice is then d(\) — /(A)(l — e~ rA ), where r is a constant that 
defines the amount of reddening in the observed spectrum 
and depends on the size and density of the dust particles in 
the interstellar medium. 

A synthetic galactic spectrum, p(A), can be built given 
ci,C2,C3, the relative contributions of young, intermediate 
age and old stellar populations, respectively, their redden- 
ing parameters n , ri , rz , and the ages of the populations 

Ol, 02, CI3. 

3 

ff (A) = ^ ClS (a,,A)(l- e -''- A ) (1) 
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0. Let T be the test spectra 
1.5 = {} 

2. For i = 1 to n 

2.1. Generate random parameter vector p = [ci, C2, C3, ri, T2, 7*3, 
01,02,03] 

2.2. Generate spectra s according to p 

2.3. S = SU{«s),<ri,r 2 ,r-3»} 

3. While T ^ {} do: 

3.1. Build C, an ensemble of approximators using learning 
algorithm LWLR and training data S 

3.2. For every test spectra t £ T 

3.2.1. Use C to predict the reddening parameters ri,rn, Tz oft 

3.2.2. error(q*) = 00 

3.2.3. For every triple (ai,o 2 ,o 3 ) e {3 X 10 6 } X {10 s , 3 X 10 s , 
5 x 10 8 , 8 x 10 8 }x {10 9 , 2 x 10 9 , 3 x 10 9 , 5 x 10 9 , 10 10 } 



e-''i A i),...,s(ai,A m )(l 
e-^ x ^,...,s(a 2 ,X m )(l 
"-^),...,s(o 3 ,A m )(l 



s(ai, Ai)(l 
R = s(a 2 ,Ai)(l 
s(a3, Ai)(l 
[ci,c 2 ,c 3 ] = t(R T R) — 1 R T 
Generate spectra g according to q 
01,02,03] 

errorfa) = £ A (g(A) - t(A)) 2 
If error(q) < error (g*) 

g* = q 

9* =9 

3.2.4. If error(q*) <threshold 
output (t, o*) 
T = T - {f.} 

ElseS , = SU{«g*),{ri,r2,r 3 ))} 



-riA„ 
r 2 A„ 
r a A„ 



ci,C2,C3,ri,r2,r3, 



Table 1. Pseudo-code of our Active Learning Algorithm (de- 
scribed in Section 151. 



where g(X) is the energy flux detected at wavelength A and 
s(di, A) is the flux emitted by a stellar population of age a» 
at wavelength A. 

The task of analyzing an observed galaxy spec- 
trum t consists of finding the parameter vector q — 
[ci , C2 , C3 , r 1 , T2 , r-j, , a\ , 02 , 03] that minimizes : 



e rror(g)=^(t(A)-( ? (A)) 2 



(2) 



Clearly, ci, C3 have to be non- negative, and sum up 
to 1, also, realistic values of n, rs are in the narrow range 
[1 x 10 _5 ,6 x 1(T 4 ], and using only a few discrete values for 
ai,ei2 and 03 normally suffices for a reasonable approxima- 
tion. In particular, for our experiments we consider stellar 
population ages ai £ {3xl0 6 }, a 2 G {10 s , 3 x 10 8 , 5 x 10 8 , 8 x 
10 8 }, and o 3 € {10 9 ,2 x 10 9 ,3 x 10 9 ,5 x 10 9 ,10 10 }. 



3 THE METHOD 

In the application proposed here, galactic spectral analysis, 
the algorithm estimates the ages of three SSP, their individ- 
ual contribution to the total light plus the reddening from 
a high resolution or equivalently high dimensionality input 
spectrum. In general, all learning algorithms, such as neural 
networks, C4.5 (Quinlan 1993), and locally weighted regres- 
sion, face the well known curse of dimensionality (Bellman 
1957), which essentially states that the number of training 
examples needed to approximate a function accurately grows 



exponentially with the dimensionality of the task. To cir- 
cumvent the curse of dimensionality, we partition the prob- 
lem into three subproblems, each of which is amenable to 
be solved by a different method. The key point is that the 
dust extinction is a non-linear effect that takes long to es- 
timate, thus if the values of the reddening parameters were 
known, it would be possible to just perform a search over 
the possible combinations of values for the ages of stellar 
populations (a total of 1 x 4 x 5 = 20 for the Granada mod- 
els, and a total of 16 for the Padova models), and for each 
combination of ages find the contributions that best fit the 
observation using least squares. Then the best overall fit 
would be the combination of ages and contributions that 
resulted in the best match to the test spectrum. Thus, the 
crucial sub-problem to be solved is that of determining the 
reddening parameters. 

Predicting the reddening parameters from spectra is a 
difficult non-linear optimization problem, specially for the 
case of noisy spectra. We propose to solve it using an itera- 
tive active learning algorithm that learns the function from 
spectra to reddening parameters. In each iteration, the al- 
gorithm uses its training set to build an approximator to 
predict the reddening parameters of the spectra in the test 
set. Once the algorithm has predicted these parameters, it 
uses them to find the combination of ages and contributions 
that yield the best match to the observed spectra. ^From 
these parameters we can generate the corresponding spec- 
trum, and compare it with the spectrum under analysis, if 
they are a close match, then the parameters found by the 
algorithm are correct, if not, we can add the newly gener- 
ated training example (the predicted parameters and their 
corresponding spectrum) to the training set and proceed to 
a new iteration. Since this type of active learning adds to 
the training set examples that are progressively closer to 
the points of interest, the errors are guaranteed to decrease 
in every iteration until convergence is attained. In this al- 
gorithm the criteria to halt the iterative process can be an 
error threshold, or a maximum number of iterative steps. 

An outline, in the form of pseudocode, of the algorithm 
is given in tableQ In steps 1 and 2 we build an initial train- 
ing set S containing N spectra (the attributes), generated 
for randomly chosen parameter vectors (the target function), 
applying equation^ Step 3 forms the main loop, in which we 
will attempt to obtain the parameters that best match the 
spectra under analysis (set T), this step is repeated until a 
satisfactory fit has been found for every spectrum in the test 
set. First, in step 3.1, an approximator C is built using S and 
an ensemble and locally weighted linear regression (LWLR), 
a well-known instance-based learning algorithm (Atkeson et 
al. 1997). Using C, we obtain candidate reddening parame- 
ters [ri, r 2 , rs] for each spectrum in the test set T, this is the 
non-linear part of the problem (step 3.2.1). Given the candi- 
date reddening parameters, in step 3.2.3 we find the ages of 
the stellar populations [01,02,03] and their relative contri- 
butions [01,02,03] using a combination of exhaustive search 
and least squares fitting. For each of the possible 20 combi- 
nations of ages we find the relative contributions that best 
match the spectrum under analysis using a pseudo-inverse 
computation and then choose among the 20 combination 
of ages and corresponding contributions the one that min- 
imizes the residuals (equation |5J . In step 3.2.4 we simply 
test if the parameter vector results in a satisfactory fit, if 
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the error for the best approximation, computed as depicted 
in equation^ is smaller than a set threshold, it outputs the 
set of parameters found for that spectrum and removes it 
from the test set, if the error is not small enough, it adds 
the new training example to the training set and continues 
the process. 

It should be pointed out that the active learning algo- 
rithm is independent of the choice of base learning algorithm 
used to predict the reddening parameters. Any algorithm 
that is suitable to predict real-valued target functions from 
real-valued attributes could be used. In this work we use 
an ensemble of locally- weighted linear regression (LWLR), 
but others such as K-nearest-neighbours could have been 
applied. In the following two sections we briefly present the 
ideas behind our chosen base learning algorithm. 

3.1 Ensembles 

An ensemble of classifiers is a set of classifiers whose in- 
dividual decisions are combined in some way, normally by 
voting. In order for an ensemble to work properly, individual 
members of the ensemble need to have uncorrelated errors 
and an accuracy higher than random guessing. There are 
several methods for building ensembles. One of them, which 
is called bagging (Breiman 1996), consists of manipulating 
the training set. In this technique, each member of the en- 
semble has a training set consisting of m examples selected 
randomly with replacement from the original training set 
of m examples (Dietterich 2000). Another technique similar 
to bagging manipulates the attribute set. Here, each mem- 
ber of the ensemble uses a different subset randomly chosen 
from the attribute set. More information concerning ensem- 
ble methods, such as boosting and error-correcting output 
coding, can be found in (Dietterich 2000). The technique 
used for building an ensemble is chosen according to the 
learning algorithm used, which in turn is determined by the 
learning task. In the work presented here, we use the tech- 
nique that randomly selects subsets of attributes. 

3.2 Locally- Weighted Regression 

Locally- Weighted Regression (LWR) belongs to the family 
of instance-based learning algorithms, which includes algo- 
rithms as the basic K-nearest neighbour and radial basis 
functions (Powell 1987). In contrast to most other learn- 
ing algorithms, which use their training examples to con- 
struct explicit global representations of the target function, 
instance-based learning algorithms simply store some or all 
of the training examples and postpone any generalization 
effort until a new instance must be classified. They can thus 
build query-specific local models, which attempt to fit the 
training examples only in a region around the query point. 
In this work we use a linear model around the query point 
to approximate the target function. 

Given a query point x q , to predict its output parameters 
y q , we find the k examples in the training set that are closest 
to it, and assign to each of them a weight given by the inverse 
of its distance to the query point: Wi — x j_ x . . Let W, the 
weight matrix, be a diagonal matrix with entries wi, . . . , w n . 
Let X be a matrix whose rows are the vectors Xi, . . . , Xk, the 
input parameters of the examples in the training set that are 



closest to x q , with the addition of a "1" in the last column. 
Let Y be a matrix whose rows are the vectors yi,... ,yk, 
the output parameters of these examples. Then the weighted 
training data are given by Z — WX and the weighted target 
function is V = WY. Then we use the estimator for the 
target function y q = x q T (Z T Z)' 1 Z T V. 

Thus, locally weighted linear regression is very similar 
to least-squares linear regression, except that the error terms 
used to derive the best linear approximation are weighted by 
the inverse of their distance to the query point. Intuitively, 
this yields much more accurate results than standard linear 
regression because the assumption that the target function is 
linear does not hold in general, but is a good approximation 
when only a small size neighborhood is considered. 



4 DISCUSSION 

In all the experiments reported here we used the following 
procedure: firstly we generated a random set of 200 galactic 
spectra with their corresponding parameters. This set was 
then randomly divided into two disjoint subsets, one subset 
consisting of 50 galactic spectra was used for training and 
the remaining 150 was considered the test set. This proce- 
dure was repeated 10 times, and we report here the overall 
mean results. 

In the first set of experiments our objective was to deter- 
mine empirically the differences between the active learning 
procedure versus a traditional ensemble of LWLR. As men- 
tioned previously, the ensembles were constructed selecting 
randomly a subset of the attributes. To make the compari- 
son objective, both methods used the same attribute subset 
and an ensemble of size 5. In FigureQwe show the distribu- 
tion for prediction errors in intermediate and old ages using 
the Granada models. This error is measured as the distance 
in logarithmic steps between the real age and the predicted 
one. We can see that even though the traditional ensemble of 
LWLR performs well, our active algorithm achieves higher 
accuracy. Figure shows error distributions corresponding 
to the prediction of relative contributions: ci,C2 and C3, of 
each age population for the Granada models also. We can 
see that for the active algorithm the central bars are higher 
than those of an ensemble of LWLR. Error distributions for 
prediction of reddening parameters are shown in Figure [3] 
Comparable results in all the experiments were obtained us- 
ing the Padova models, in Tables [3] 21 and [2] we present the 
more discrepant results. 

Figures 2] and |S] show graphical comparisons between a 
test spectrum, a reconstructed spectrum using traditional 
LWLR and our active learning technique for both mod- 
els. The residuals are always smaller than 3 percent for the 
LWLR method and clearly much smaller for the active learn- 
ing technique. The fact that the active learning technique 
outperformed the traditional ensemble of LWLR was not 
surprising. Although both are based on the same learning 
algorithm, LWLR, the training sets from which the predic- 
tions are computed are different. The main difference be- 
tween these two techniques lies on the iterative process of 
the active learning algorithm. In each iteration, the active 
learning algorithm augments its training set with new ex- 
amples that will allow it to better approximate the observed 
spectra. And this iterative process continues until a suitable 



A method for the Determination of Stellar Population Parameters 5 







(b) 




Figure 1. Distribution of errors in the age prediction of inter- 
mediate and old populations using the Granada models. Error in 
age prediction is measured as the distance in logarithmic steps be- 
tween the age of the test spectrum and the predicted age. Figure 
(a), intermediate age, and (b), old, are the predictions of a tra- 
ditional LWLR ensemble. Figures (c) and (d) are the predictions 
of our algorithm for the same ages and test spectra. 
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Figure 2. Distribution of prediction errors in the relative con- 
tribution parameters Ci,C2, and C3 (see section l2.ll . using the 
Granada models. Figures (a) to (c) are the predictions using an 
ensemble of LWLR, for young, intermediate and old populations; 
figures (d) to (f) are predictions using our active learning algo- 
rithm. 



solution is found for each spectrum in the test set. As the 
traditional ensemble of LWLR lacks this iterative process, it 
will output the best predictions it can reach using only the 
original training set. 



4.1 The effect of noise 

The results presented above are very encouraging. However, 
the data used in those experiments were noise free. We are 
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Figure 3. Distribution of prediction errors in the reddening pa- 
rameters n, V2 and r3, (see section fe.ll , using the Granada mod- 
els. Figures (a) to (c) are the predictions for young, intermediate 
and old populations using an ensemble of LWLR, figures (d) to 
(f) are predictions using our active learning algorithm. 
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Figure 4. Graphical comparison of results using the Granada 
models. Figure (a) from top to bottom and shifted by a constant 
to aid visualization: original test spectrum, spectrum recovered 
using ensemble of LWLR and spectrum recovered using active 
learning. Figures (b) and (c) show, in the same scale, the residuals 
between test and predicted spectra in the same listed order. 
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(c) 

Figure 5. Same as Figure|I]but using the Padova models. Figure 

(a) from top to bottom and shifted by a constant to aid visualiza- 
tion: original test spectrum, spectrum recovered using ensemble 
of LWLR and spectrum recovered using active learning. Figures 

(b) and (c) show, in the same scale, the residuals between test 
spectrum and predicted spectra in the same listed order. The red 
end of the spectra was completed using Padova 20 A resolution 
models from 4500 to 7000 A (Bressan et al. 1994). 



aware that noisy data pose a more realistic evaluation of our 
algorithm, given that in real data analysis noise is always 
present. Astronomical spectral analysis is no exception to 
this rule. For this reason, we have performed a set of exper- 
iments aimed at exploring the noise-sensitivity of our active 
learning algorithm. We performed the same procedure de- 
scribed previously, except that this time we added to the 
test data a Gaussian noise with zero mean and standard de- 
viation of one. We experimented with three different signal 
to noise (S/N) ratios: 5, 30 and 100 corresponding to bad, 
normal and good data respectively. Here we present only re- 
sults for the lower S/N level, given that as the noise level 
decreases error predictions are more similar to noiseless data 
experiments. 

As a first stage in the treatment of noisy data we used a 
procedure involving standard principal component analysis 
(PC A). PC A seeks a set of M orthogonal vectors v and their 
associated eigenvalues k which best describe the distribution 
of the data. This module takes as input the training set, and 
finds its principal components (PC). The noisy test data are 
projected onto the space defined by the first 20 PC, which 
were found to account for about 99% of the variance in the 
set, and the magnitudes of these projections are used as at- 
tributes for the algorithm. Experiments with larger number 



of PC (up to 150) showed no significant improvement in the 
results. 

Figure HJ shows the error distribution in the age predic- 
tion using noisy data (S/N — 30) with an ensemble of LWLR 
and active learning using the Granada models. In the case of 
intermediate age prediction, both algorithms achieve almost 
identical errors. In contrast, for prediction of old popula- 
tions the active learning algorithm slightly outperforms the 
ensemble of LWLR. It is important to note that the cen- 
tral peak contains more than 60% of the results, while the 
+ 1,-1 bins include about 20% of the cases. For our method, 
about 85% shows an error in the age determination that is 
equal or smaller than one age step. Prediction of the rela- 
tive contributions, presented in Figure [7] is not as peaked 
as the age prediction but still a substantial fraction is inside 
a small error. Our method in this case shows a moderate 
improvement with respect to the LWLR. This same behav- 
ior can be observed in Figure |H1 where error distribution in 
the prediction of reddening parameters n ,ra and are pre- 
sented. Results for the Padova theoretical models are similar 
to those for the Granada models, although the improvement 
achieved by our active learning technique is much higher in 
the case of the Padova models, specially for the parameters 
of the old populations. Another set of figures presents results 
of experiments with very noisy data, using an S/N=5. For 
the Granada models distribution of errors in predictions are 
shown in Figures |3] 1101 and 1111 It is remarkable that even 
with low quality (S/N=5) data the algorithm does such a 
good estimate of the population ages. For our method in 
about 80% of the cases the error in the age determination is 
equal or less than one age step. However, it is evident that 
the active technique was unable to improve accuracy due to 
high levels of noise in some particular cases. For instance, 
for the Granada models prediction of reddening parameter 
for young populations, n, presented higher error rates us- 
ing our algorithm than using a traditional LWLR ensemble. 
However, in the estimation of reddening for intermediate 
and old populations the inverse of this occurred, the tra- 
ditional approach was outperformed by our algorithm. Our 
algorithm also achieved higher accuracy for the estimation 
of relative contribution parameters. For the Padova models, 
in the majority of the cases better results were achieved by 
the active algorithm; only in one case, the age prediction 
of old populations, the active algorithm had slightly higher 
errors. In Figures TTH and [T3*l we show graphical comparisons 
between test spectra and reconstructed ones using an LWLR 
ensemble and our active learning algorithm with an S/N=5. 
Comparing these figures with the results obtained for noise- 
less data we can say that the improvement in the fitting of 
the active technique is lower with very noisy data, although 
the advantage of the technique is still significant at the low- 
est S/N ratio. 

4.2 Lick Indices 



One remarkable aspect found in the experiments with 
noise included is that even when using a large number of PC 
the residuals showed relatively high peaks in some specific 
narrow spectral regions. Surprisingly several of these high 
residual regions coincide with the central band of the Lick 
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Figure 6. Distribution of errors in the age prediction of inter 
mediate and old populations using the Granada models and an 
S/N=30. Error in age prediction is measured as the distance in 
logarithmic steps between the age of the test spectrum and the 
predicted age. Figure (a), intermediate age, and (b), old, are the 
predictions of a traditional LWLR ensemble. Figures (c) and (d) 
are the predictions of our algorithm for the same ages and test 
spectra. 
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Figure 8. Distribution of prediction errors in the reddening pa- 
rameters n , T2 and r-j (see section ^, II using the Granada models 
and an S/N=30. Figures (a) to (c) are the predictions using an 
ensemble of LWLR for young, intermediate and old populations; 
figures (d) to (f) are predictions using our active learning algo- 
rithm. 
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Figure 7. Distribution of prediction errors in the relative con- 
tribution parameters ci , C2 and C3 (see section 12.11 using the 
Granada models and an S/N=30. Figures (a) to (c) are the pre- 
dictions using an ensemble of LWLR for young, intermediate and 
old populations; figures (d) to (f) are predictions using our active 
learning algorithm. 



indices (listed in Table |5J, see Figure [TTI where the Lick in- 
dices are superimposed on the residuals of the reconstructed 
spectrum. In other words, giving equal weights to all pixels 
(or fluxes) produced larger residuals located in these narrow 
regions. 

We opted to explore whether prior knowledge about 
the Lick indices can help machine learning algorithms to 
provide a more accurate prediction. Thus, we experimented 



Figure 9. Distribution of errors in the age prediction of inter- 
mediate and old populations using the Granada models and an 
S/N=5. Figure (a), intermediate age, and (b), old, are the predic- 
tions of a traditional LWLR ensemble. Figures (c) and (d) are the 
predictions of our algorithm for the same ages and test spectra. 



using two different approaches aimed at giving more influ- 
ence to the central bands of the Lick indices. In the first 
approach we discarded all information about most of the 
spectra, keeping only the flux information corresponding to 
the central bands of the Lick indices. The learning algorithm 
thus predicts the reddening parameters using only this re- 
duced subset of fluxes. In a similar way, the contribution of 
ages is estimated using the same subset of fluxes. Figures 
1151 1161 and 1171 show a comparison of error distributions be- 
tween active learning when using the original data and active 
learning when using the Lick indices for the Granada mod- 
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Figure 10. Distribution of errors in the prediction of the relative 
contribution parameters ci, ci and C3 (see section ^, II using the 
Granada models and an S/N=5. Figures (a) to (c) are the pre- 
dictions using an ensemble of LWLR for young, intermediate and 
old populations; figures (d) to (f) are predictions using our active 
learning algorithm. 
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Figure 11. Distribution of prediction errors in the reddening pa- 
rameters n , T2 and r$ (see section r2.ll using the Granada models 
and an S/N=5. Figures (a) to (c) are the predictions using an en- 
semble of LWLR for young, intermediate and old populations; 
figures (d) to (f) are predictions using our active learning algo- 
rithm. 



els. We present here only the results using very noisy data 
(S/N =5) given that previous experiments showed higher er- 
ror rates for this scenario. For the Padova models using this 
prior knowledge did not yield higher accuracy in the case of 
an LWLR ensemble; predictions from active learning using 
the original data are more accurate. However, when using 
the Lick indices and active learning reddening parameters 
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Figure 12. Graphical comparison of results using the Granada 
models and noisy data, ration S/N=5. Figure (a) from top to 
bottom and shifted by a constant to aid visualization: noisy test 
spectrum, spectrum recovered using ensemble of LWLR and spec- 
trum recovered using active learning. Figure (b) show the resid- 
uals of the reconstructed spectrum using ensemble of LWLR and 
figure (c) is the corresponding residuals of using the active learn- 
ing technique. 



are estimated better, as well as the relative contribution of 
ages. In the case of the Granada models the best results were 
achieved by active learning using the original data. 

The other approach for incorporating prior knowledge 
consists of increasing the relevance of the Lick indices. By 
doing so, differences in the Lick indices of the data will have 
more weight than the differences through the rest of the 
spectrum; this will be reflected when LWLR selects the clos- 
est examples to the test spectrum (see Subsection 13.21 . To 
do this we multiplied the energy fluxes in the wavelengths 
corresponding to the Lick indices by a constant k. That is, 
fluxes in regions denned by Lick indices where deemed to be 
k times more important than pixels in other regions. This 
value of k — 4 was set experimentally with a 10-fold cross- 
validation procedure. We present results using very noisy 
data (S/N=5). These results are similar to results previously 
discussed. We find that while for the Padova models the best 
results come from active learning with prior knowledge, for 
the Granada models this is not the case and the best results 
come from active learning and the original data. Figures IT51 
to 1201 present error distribution of these experiments. This 
whole topic will be further investigated in a forthcoming 
paper, where the method is applied to real data. 

Using prior knowledge did not yield meaningful im- 
provements, moreover, for some parameters the error in- 
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Figure 13. Same as Figure Il2lbut using the Padova models and 
noisy data, ratio S/N=5. Figure (a) from top to bottom and 
shifted by a constant to aid visualization: noisy test spectrum, 
spectrum recovered using ensemble of LWLR and spectrum re- 
covered using active learning. Figure (b) show the residuals of the 
reconstructed spectrum using ensemble of LWLR and figure (c) is 
the corresponding residuals of using the active learning technique. 



creased when incorporating prior knowledge. In another at- 
tempt to improve results with very noisy data we carried out 
another set of experiments. This time we build an ensem- 
ble using the predictions from the three approaches: active 
learning using the original data, active learning using the 
fluxes corresponding to the Lick indices and active learning 
with more weight given to fluxes around the central bands of 
the Lick indices. All the ensemble predictions are then com- 
puted as the average of the predictions from each approach. 

These results were the most accurate ones, even with 
high levels of noise, they are presented in Figures TU] to |2"j1 
These figures show a marginal improvement with respect to 
those of Figures ISlto lllI A graphical comparison is presented 
in Figures 1241 and 1251 It may be argued that the inclusion 
of constant Gaussian noise to the synthetic spectrum will 
produce a low S/N in those regions with lower signal and 
this will preferentially affect the Lick indices. While some of 
this is present for the deepest features, it cannot be a ma- 
jor effect for the large majority of the Lick indices where the 
flux in the control band only changes by less than 20 percent 
in average with respect to the side bands. The improvement 
in the concentration of results is clearly illustrated in Ta- 
ble 4 where the central bin frequency increases substantially 
by the inclusion of prior knowledge. In general, the results 
obtained from both, the Padova and the Granada models, 
support the conclusion that the best method when dealing 
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Figure 14. Residuals between a test and a predicted spectrum, 
the horizontal bands above the residuals show the Lick Indices. 
We split the spectrum to aid visualization, the top of the figure 
shows residuals in the blue part while the bottom shows the red 
part residuals. 
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Figure 15. Distribution of errors in the age prediction of inter- 
mediate and old populations for the Granada models using an 
S/N=5. Figure (a), intermediate age, and (b), old, are the pre- 
dictions of active learning and the original data. Figures (c) and 
(d) are the predictions of active learning using only the fluxes of 
the central lines of the Lick indices. 



with low S /N data seems to be the combination of ensemble 
and Prior knowledge. 

Results are similar for both sets of theoretical mod- 
els. However, we can point some interesting differences in 
the experimental results. For instance, when using noiseless 
data the method achieves slightly higher prediction accura- 
cies for the Granada models. Another difference mentioned 



10 T. Solorio et al. 



Name 


Index Rcgin 


Index End 


RA/H CNR 


3810.0 


3910.0 


HKratioK 


3920.0 


3945.0 


HKratioH 


3955 


3980 


Hd 


4080.0 


4120.0 


J-J1\_JY_V_;1N J_ 


41 4^ 37^ 


41 78 37^ 


RCrff CaT 

IJOcll— Vjdl 


4215.0 


4245.0 


T.irlr Pa/L997 

J_j1v_.1S \^jaf±£i£i 1 


'iZi^o. o\jyj 




Lick G4300 


4282 625 


4^1 7 69^ 


R&H.G 


4285.0 


4315.0 


n B 


4320.0 


4360.0 


Lick_Fe4383 


4370 375 


4421.625 


Lick Ca4455 


4453 375 


4475 875 


Lick_Fe4531 


451 5 500 


4560 500 


T irk PAfifiS 




4791 ^nn 


i-> Oo 1 -1 _1_ 1 u 


4830.0 




Lick Hh 


4848 87^ 


4877 69^ 


Lick.Fe5015 


4979.000 


5055.250 


Lick_Mg2 


5155.375 


5197.875 


Lick_Fe527 


5247.375 


5287.375 


Lick_Fe5335 


5314.125 
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Lick_NaD 


5878.625 


5911.125 
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Table 2. The table was constructed based in the SLOANE in- 
dex list. We remove the duplicate indices leaving where possible 
the Lick indices. We also added indices for the Ralmer lines, the 
Ha, H7 and H<5. The HK ratio index was decomposed into two 
bands. Ry Index here we mean only the central band and not the 
continuum side bands. 




(f) 



Figure 17. Distribution of prediction errors in the reddening 
parameters r\ , T2 and rz (see section \2. II for the Granada models 
using an S/N=5. Figures (a) to (c) are the predictions of active 
learning and the original data for young, intermediate and old 
populations respectively. Figures (d) and (f) are the predictions 
of active learning using only the fluxes of the central lines of the 
Lick indices. 
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Figure 16. Distribution of prediction errors in the relative contri- 
bution parameters ci, C2 and C3 (see section fe.ll for the Granada 
models using an S/N=5. Figures (a) to (c) are the predictions of 
active learning and the original data for young, intermediate and 
old populations respectively. Figures (d) and (f) are the predic- 
tions of active learning using only the fluxes of the central lines 
of the Lick indices. 



Figure 18. Distribution of prediction errors in the age prediction 
of intermediate and old populations for the Granada models using 
an S/N=5. Figure (a), intermediate age, and (b), old, are the 
predictions of active learning and the original data. Figures (c) 
and (d) are the predictions of active learning with the fluxes of 
the central lines of the Lick indices magnified by a constant k = 4. 



previously is that for the Granada models prior knowledge 
does not seem to be very useful by itself. Although, the only 
method that for this models yields better results than ac- 
tive learning with original data, is the combination of the 
three predictions: original data plus the two methods for 
using prior knowledge. In contrast, for the Padova models 
both methods for using prior knowledge improved prediction 
accuracy. It should be emphasized that these differences be- 
tween the models are not significative, and we do not con- 
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Figure 19. Distribution of prediction errors in the relative contri- 
bution parameters ci, C2 and C3 (see section fe.ll for the Granada 
models using an S/N=5. Figures (a) to (c) are the predictions of 
active learning and the original data for young, intermediate and 
old populations respectively. Figures (d) and (f) are the predic- 
tions of active learning with the fluxes of the central lines of the 
Lick indices magnified by a constant k = 4. 



sider they can be thought of as evidence of the correctness of 
the models, but as the relation the spectrum has to the set 
of parameters on each model. Results on both models show 
that the ensemble of classifiers combining different forms of 
incorporating prior knowledge is the best alternative, spe- 
cially when the data have high levels of noise. 



5 CONCLUSIONS 

We presented in this work an optimization algorithm that 
can estimate with high accuracy: age distributions and mix- 
tures plus the reddening of stellar population in galaxies. 
The algorithm achieves convergence by iteratively creating 
new data points that lie in the vicinity of the query point. 

Our experimental results using two sets of theoretical 
models and different levels of noise, show that even with low 
quality (S/N=5) data the algorithm does a good estimate 
of the population ages, proportions and reddening. For our 
method in about 80% of the cases the error in the age de- 
termination is equal or less than one age step. In general, 
the results obtained from both the Padova and the Granada 
models support the conclusion that the best method when 
dealing with low S /N data seems to be the combination of an 
ensemble and prior knowledge. Another important feature of 
this method is its high speed, it takes ~10 seconds in a nor- 
mal PC to estimate the parameters of a single 20,000 pixel 
spectrum. This represents a great advantage over other more 
conventional methods proposed for this problem, which may 
take up to a couple of hours to find the solution for such a 
spectrum. 

We will continue our efforts to improve parameter esti- 
mation of stellar populations. In forthcoming papers we ex- 
periment with models of different metallicities, by adapting 



Figure 20. Distribution of prediction errors in the reddening 
parameters r\ , T2 and rz (see section \2. II for the Granada models 
using an S/N=5. Figures (a) to (c) are the predictions of active 
learning and the original data for young, intermediate and old 
populations respectively. Figures (d) and (f) are the predictions 
of active learning with the fluxes of the central lines of the Lick 
indices magnified by a constant k = 4. 
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Figure 21. Distribution of prediction errors in the age prediction 
of intermediate and old populations for the Granada models using 
an S/N=5. Figure (a), intermediate age, and (b), old, are the 
predictions of the active learning algorithm. Figures (c) and (d) 
are the predictions of the ensemble combining prior knowledge. 



this method successfully to this problem. Also, we explore 
different methods for exploiting prior knowledge and apply 
them to large spectral databases (e.g. SDSS). 

Based on this experimental evaluation we conclude that 
this method can be applied with similar success to "real" 
galaxies, reducing the computational cost and thus provid- 
ing the capability of analyzing large quantities of astronom- 
ical spectroscopic data. 
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Table 3. Frequency Table for prediction of ages using the Padova models and an S/N=5. 
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Table 4. Frequency Table for prediction of relative contributions using the Padova models and an S/N=5. 
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Figure 22. Distribution of prediction errors in the relative contri- 
bution parameters ci, C2 and C3 (see section ^, II for the Granada 
models using an S/N=5. Figures (a) to (c) are the predictions of 
active learning and the original data for young, intermediate and 
old populations respectively. Figures (d) and (f) are the predic- 
tions of the active learning ensemble combining prior knowledge. 



Figure 23. Distribution of prediction errors in the reddening 
parameters r\ , T2 and (see section ^, II for the Granada models 
using an S/N=5. Figures (a) to (c) are the predictions of active 
learning and the original data for young, intermediate and old 
populations respectively, figures (d) to (f) are predictions of the 
ensemble combining prior knowledge. 
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Table 5. Frequency Table for prediction of reddening parameters using the Padova models and an S/N=5. 



x 10 




3000 3500 4000 4500 5000 5500 6000 6500 



=1.5 



S 1 

01 



x10 



3000 3500 4000 



4500 



51-5 



x 10 



5000 5500 

(b) 



1 ..][. .1, „ 



6500 



o 

3000 



3500 



4000 



4500 



5000 
(C) 



5500 6000 6500 



Figure 24. Graphical comparison of results using the Granada 
models and noisy data, ratio S/N=5. Figure (a) from top to bot- 
tom and shifted by a constant to aid visualization: noisy test spec- 
trum, spectrum recovered using active learning and the original 
data and spectrum recovered using active learning combining pre- 
dictions. Figures (b) and (c) show the relative difference between 
test spectrum and predicted spectra in the same listed order. 
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Figure 25. Graphical comparison of results using the Padova 
models and noisy data, ratio S/N=5. Figure (a) from top to bot- 
tom and shifted by a constant to aid visualization: noisy test spec- 
trum, spectrum recovered using active learning and the original 
data and spectrum recovered using active learning combining pre- 
dictions. Figures (b) and (c) show the relative difference between 
test spectrum and predicted spectra in the same listed order. 
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